candidate plan
Subgoal Graph-Augmented Planning for LLM-Guided Open-World Reinforcement Learning
Fan, Shanwei, Zhang, Bin, Xu, Zhiwei, Teng, Yingxuan, Dai, Siqi, Cheng, Lin, Fan, Guoliang
Large language models (LLMs) offer strong high-level planning capabilities for reinforcement learning (RL) by decomposing tasks into subgoals. However, their practical utility is limited by poor planning-execution alignment, which reflects a critical gap between abstract plans and actionable, environment-compatible behaviors. This misalignment arises from two interrelated limitations: (1) LLMs often produce subgoals that are semantically plausible but infeasible or irrelevant in the target environment due to insufficient grounding in environment-specific knowledge, and (2) single-LLM planning conflates generation with self-verification, resulting in overconfident yet unreliable subgoals that frequently fail during execution. To address these challenges, we propose Subgoal Graph-Augmented Actor-Critic-Refiner (SGA-ACR), a framework that integrates an environment-specific subgoal graph and structured entity knowledge with a multi-LLM planning pipeline that explicitly separates generation, critique, and refinement to produce executable and verifiable subgoals. A subgoal tracker further monitors execution progress, provides auxiliary rewards, and adaptively updates the subgoal graph to maintain alignment between plans and actions. Experimental results on 22 diverse tasks in the open-world game "Crafter" demonstrate the effectiveness of our proposed method.
- Asia > China > Beijing > Beijing (0.05)
- North America > United States > District of Columbia > Washington (0.04)
- Asia > China > Shandong Province (0.04)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.68)
Efficient Diffusion Planning with Temporal Diffusion
Guo, Jiaming, Zhang, Rui, Li, Zerun, Gao, Yunkai, Peng, Shaohui, Lan, Siming, Hu, Xing, Du, Zidong, Zhang, Xishan, Li, Ling
Diffusion planning is a promising method for learning high-performance policies from offline data. To avoid the impact of discrepancies between planning and reality on performance, previous works generate new plans at each time step. However, this incurs significant computational overhead and leads to lower decision frequencies, and frequent plan switching may also affect performance. In contrast, humans might create detailed short-term plans and more general, sometimes vague, long-term plans, and adjust them over time. Inspired by this, we propose the Temporal Diffusion Planner (TDP) which improves decision efficiency by distributing the denoising steps across the time dimension. TDP begins by generating an initial plan that becomes progressively more vague over time. At each subsequent time step, rather than generating an entirely new plan, TDP updates the previous one with a small number of denoising steps. This reduces the average number of denoising steps, improving decision efficiency. Additionally, we introduce an automated replanning mechanism to prevent significant deviations between the plan and reality. Experiments on D4RL show that, compared to previous works that generate new plans every time step, TDP improves the decision-making frequency by 11-24.8 times while achieving higher or comparable performance.
- Europe > Austria > Vienna (0.14)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.05)
- North America > United States > Hawaii > Honolulu County > Honolulu (0.04)
- (4 more...)
- Research Report > Promising Solution (0.48)
- Research Report > New Finding (0.46)
Enhancing Long Chain-of-Thought Reasoning through Multi-Path Plan Aggregation
Xiong, Siheng, Payani, Ali, Fekri, Faramarz
Monte Carlo (TSMC) to provide scalable stepwise supervision using small LMs. This yields more efficient training, improved stability, and higher accuracy. OpenAI's o1 series (OpenAI, 2024) introduce inference-time scaling by increasing the length of the Chain-of-Thought (CoT) (Wei et al., 2022) reasoning process. Despite their empirical success, RL approaches that generate the entire reasoning chain in a single forward pass face notable limitations, including CoT derailment, where the reasoning trajectory drifts off course due to accumulated errors, and the inherent challenges of long-horizon RL with sparse outcome rewards. This sequential scaling strategy, i.e., simply extending the CoT length, can therefore be insufficient (Y ang et al., 2025). To improve planning quality, we introduce Multi-Path Plan Aggregation (MPP A). For each planning step, the model generates multiple alternative plans and aggregates them into an improved plan before proceeding to the subsequent execution steps. Beyond enhancing planning, we identify a fundamental challenge in credit assignment for long-horizon policy learning (Kaelbling et al., 1996). Existing RL fine-tuning frameworks struggle to provide effective process-level supervision (Guo et al., 2025). First, evaluating the correctness of intermediate steps is inherently difficult. Automated annotation using LLM judges (Gu et al., 2024) often yield unreliable or noisy signals Second, introducing a separate process reward model (PRM) adds complexity. We then define the process preference between two candidate continuations at the same step by comparing their incremental log-weights. We repurpose Twisted Sequential Monte Carlo (TSMC) to provide process-level preferences for online Step-DPO training. Results show that our approach consistently outperforms both distillation-based long-CoT methods and RL methods that rely solely on outcome rewards. The Chain-of-Thought trajectories can be lengthy and the positions of the first error vary considerably, making outcome-based RL fine-tuning inefficient. Training long trajectories with outcome rewards is highly inefficient.
- Workflow (0.87)
- Research Report > New Finding (0.48)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.44)
Robust Planning for Autonomous Driving via Mixed Adversarial Diffusion Predictions
-- We describe a robust planning method for autonomous driving that mixes normal and adversarial agent predictions output by a diffusion model trained for motion prediction. We first train a diffusion model to learn an unbiased distribution of normal agent behaviors. We then generate a distribution of adversarial predictions by biasing the diffusion model at test time to generate predictions that are likely to collide with a candidate plan. We score plans using expected cost with respect to a mixture distribution of normal and adversarial predictions, leading to a planner that is robust against adversarial behaviors but not overly conservative when agents behave normally. Unlike current approaches, we do not use risk measures that over-weight adversarial behaviors while placing little to no weight on low-cost normal behaviors or use hard safety constraints that may not be appropriate for all driving scenarios. We show the effectiveness of our method on single-agent and multi-agent jaywalking scenarios as well as a red light violation scenario. Predicting agent behaviors [1]-[4] is a key part of the autonomous driving pipeline. Recently, deep learning-based motion prediction methods [5]-[7] have predicted accurate multimodal distributions of agent behavior.
- North America > United States > California > Los Angeles County > Los Angeles (0.28)
- North America > United States > Massachusetts > Suffolk County > Boston (0.04)
- Transportation > Ground > Road (1.00)
- Information Technology > Robotics & Automation (0.91)
SPIO: Ensemble and Selective Strategies via LLM-Based Multi-Agent Planning in Automated Data Science
Seo, Wonduk, Lee, Juhyeon, Bu, Yi
Large Language Models (LLMs) have revolutionized automated data analytics and machine learning by enabling dynamic reasoning and adaptability. While recent approaches have advanced multi-stage pipelines through multi-agent systems, they typically rely on rigid, single-path workflows that limit the exploration and integration of diverse strategies, often resulting in suboptimal predictions. To address these challenges, we propose SPIO (Sequential Plan Integration and Optimization), a novel framework that leverages LLM-driven decision-making to orchestrate multi-agent planning across four key modules: data preprocessing, feature engineering, modeling, and hyperparameter tuning. In each module, dedicated planning agents independently generate candidate strategies that cascade into subsequent stages, fostering comprehensive exploration. A plan optimization agent refines these strategies by suggesting several optimized plans. We further introduce two variants: SPIO-S, which selects a single best solution path as determined by the LLM, and SPIO-E, which selects the top k candidate plans and ensembles them to maximize predictive performance. Extensive experiments on Kaggle and OpenML datasets demonstrate that SPIO significantly outperforms state-of-the-art methods, providing a robust and scalable solution for automated data science task.
System-Level Safety Monitoring and Recovery for Perception Failures in Autonomous Vehicles
Chakraborty, Kaustav, Feng, Zeyuan, Veer, Sushant, Sharma, Apoorva, Ivanovic, Boris, Pavone, Marco, Bansal, Somil
The safety-critical nature of autonomous vehicle (AV) operation necessitates development of task-relevant algorithms that can reason about safety at the system level and not just at the component level. To reason about the impact of a perception failure on the entire system performance, such task-relevant algorithms must contend with various challenges: complexity of AV stacks, high uncertainty in the operating environments, and the need for real-time performance. To overcome these challenges, in this work, we introduce a Q-network called SPARQ (abbreviation for Safety evaluation for Perception And Recovery Q-network) that evaluates the safety of a plan generated by a planning algorithm, accounting for perception failures that the planning process may have overlooked. This Q-network can be queried during system runtime to assess whether a proposed plan is safe for execution or poses potential safety risks. If a violation is detected, the network can then recommend a corrective plan while accounting for the perceptual failure. We validate our algorithm using the NuPlan-Vegas dataset, demonstrating its ability to handle cases where a perception failure compromises a proposed plan while the corrective plan remains safe. We observe an overall accuracy and recall of 90% while sustaining a frequency of 42Hz on the unseen testing dataset. We compare our performance to a popular reachability-based baseline and analyze some interesting properties of our approach in improving the safety properties of an AV pipeline.
- Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.47)
Risk-Averse Planning and Plan Assessment for Marine Robots
Kashani, Mahya Mohammadi, John, Tobias, Coffelt, Jeremy P., Johnsen, Einar Broch, Wasowski, Andrzej
Autonomous Underwater Vehicles (AUVs) need to operate for days without human intervention and thus must be able to do efficient and reliable task planning. Unfortunately, efficient task planning requires deliberately abstract domain models (for scalability reasons), which in practice leads to plans that might be unreliable or under performing in practice. An optimal abstract plan may turn out suboptimal or unreliable during physical execution. To overcome this, we introduce a method that first generates a selection of diverse high-level plans and then assesses them in a low-level simulation to select the optimal and most reliable candidate. We evaluate the method using a realistic underwater robot simulation, estimating the risk metrics for different scenarios, demonstrating feasibility and effectiveness of the approach.
- North America > United States (0.14)
- Europe > Norway > Eastern Norway > Oslo (0.04)
- Europe > Denmark > Capital Region > Copenhagen (0.04)
- (2 more...)
FOSS: A Self-Learned Doctor for Query Optimizer
Zhong, Kai, Sun, Luming, Ji, Tao, Li, Cuiping, Chen, Hong
Various works have utilized deep reinforcement learning (DRL) to address the query optimization problem in database system. They either learn to construct plans from scratch in a bottom-up manner or guide the plan generation behavior of traditional optimizer using hints. While these methods have achieved some success, they face challenges in either low training efficiency or limited plan search space. To address these challenges, we introduce FOSS, a novel DRL-based framework for query optimization. FOSS initiates optimization from the original plan generated by a traditional optimizer and incrementally refines suboptimal nodes of the plan through a sequence of actions. Additionally, we devise an asymmetric advantage model to evaluate the advantage between two plans. We integrate it with a traditional optimizer to form a simulated environment. Leveraging this simulated environment, FOSS can bootstrap itself to rapidly generate a large amount of high-quality simulated experiences. FOSS then learns and improves its optimization capability from these simulated experiences. We evaluate the performance of FOSS on Join Order Benchmark, TPC-DS, and Stack Overflow. The experimental results demonstrate that FOSS outperforms the state-of-the-art methods in terms of latency performance and optimization time. Compared to PostgreSQL, FOSS achieves savings ranging from 15% to 83% in total latency across different benchmarks.
- North America > United States > New York > New York County > New York City (0.04)
- Asia > China > Beijing > Beijing (0.04)
- Workflow (1.00)
- Research Report > New Finding (0.34)
- Information Technology > Artificial Intelligence > Natural Language > Information Retrieval > Query Processing (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Planning & Scheduling (0.86)
Can Large Language Models Really Improve by Self-critiquing Their Own Plans?
Valmeekam, Karthik, Marquez, Matthew, Kambhampati, Subbarao
There have been widespread claims about Large Language Models (LLMs) being able to successfully verify or self-critique their candidate solutions in reasoning problems in an iterative mode. Intrigued by those claims, in this paper we set out to investigate the verification/self-critiquing abilities of large language models in the context of planning. We evaluate a planning system that employs LLMs for both plan generation and verification. We assess the verifier LLM's performance against ground-truth verification, the impact of self-critiquing on plan generation, and the influence of varying feedback levels on system performance. Using GPT-4, a state-of-the-art LLM, for both generation and verification, our findings reveal that self-critiquing appears to diminish plan generation performance, especially when compared to systems with external, sound verifiers and the LLM verifiers in that system produce a notable number of false positives, compromising the system's reliability. Additionally, the nature of feedback, whether binary or detailed, showed minimal impact on plan generation.
- Information Technology > Artificial Intelligence > Representation & Reasoning > Planning & Scheduling (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.52)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.35)
Don't Generate, Discriminate: A Proposal for Grounding Language Models to Real-World Environments
A key missing capacity of current language models (LMs) is grounding to real-world environments. Most existing work for grounded language understanding uses LMs to directly generate plans that can be executed in the environment to achieve the desired effects. It thereby casts the burden of ensuring grammaticality, faithfulness, and controllability all on the LMs. We propose Pangu, a generic framework for grounded language understanding that capitalizes on the discriminative ability of LMs instead of their generative ability. Pangu consists of a symbolic agent and a neural LM working in a concerted fashion: The agent explores the environment to incrementally construct valid plans, and the LM evaluates the plausibility of the candidate plans to guide the search process. A case study on the challenging problem of knowledge base question answering (KBQA), which features a massive environment, demonstrates the remarkable effectiveness and flexibility of Pangu: A BERT-base LM is sufficient for setting a new record on standard KBQA datasets, and larger LMs further bring substantial gains. Pangu also enables, for the first time, effective few-shot in-context learning for KBQA with large LMs such as Codex.
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.14)
- North America > United States > Washington > King County > Seattle (0.14)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- (18 more...)
- Leisure & Entertainment (1.00)
- Education > Educational Setting (0.46)